Air Quality Based on Environmental Factors¶

Authored by: Alireza Montazeri

Duration: 90 mins
Level: Intermediate
Pre-requisite Skills: Python, Data Analysis, Machine Learning, Deep Learning (Tensorflow)

Scenario¶

Predicting air quality is crucial for managing urban environments and protecting public health. Developing a model to forecast Air Quality values helps issue warnings when air pollution levels exceed acceptable limits or when harmful conditions arise. This not only supports the work of environmental and health authorities but also contributes to improving city life by identifying key factors that impact air quality. By exploring how various environmental factors, such as temperature, humidity, and wind, influence air quality, we gain deeper insights into the causes of pollution. Incorporating spatial data allows for a more detailed analysis of geographical variations in pollution, guiding city planners and decision-makers in promoting cleaner, more sustainable urban development.

What this use case will teach you¶

At the end of this use case you will:

  • Understand how to clean and preprocess environmental datasets.
  • Perform exploratory data analysis (EDA) to extract key insights.
  • Analyze the importance of different environmental factors in influencing air quality.
  • Conduct spatial analysis to visualize geographical variations in air quality.
  • Perform data wrangling to ensure the dataset is ready for modeling and analysis.
  • Develop and evaluate a machine learning model to predict AQI.
  • Create an implementation framework for real-world application of the air quality prediction model.

Datasets Used¶

  1. Microclimate sensors data: This dataset contains climate readings from climate sensors located within the City. The data is updated every fifteen minutes and can be used to determine variations in microclimate changes throughout the day.(https://data.melbourne.vic.gov.au/explore/dataset/microclimate-sensors-data/information/)
  2. Argyle Square Air Quality: Air quality measures in Argyle Square.(https://data.melbourne.vic.gov.au/explore/dataset/argyle-square-air-quality/information/)

These datasets are sourced from the City of Melbourne's open data portal.

Usage¶

In urban environments, PM2.5 and PM10 are considered the primary pollutants due to their direct impact on human health. PM2.5 refers to particulate matter with a diameter of less than 2.5 micrometers, which is small enough to penetrate deep into the lungs and bloodstream, causing serious respiratory and cardiovascular problems. PM10, while larger (up to 10 micrometers), still poses a significant risk, especially to those with existing health conditions. Monitoring these two pollutants is essential for assessing air quality because they are the most prevalent and harmful airborne particles in urban areas, often caused by traffic, construction, industrial activities, and other pollution sources. By focusing on PM2.5 and PM10 as the main features, the model is better equipped to predict dangerous pollution levels and offer early warnings for public health and environmental management.

I used the datasets from microclimate_sensors and argyle_air_quality to train two models—an LSTM (Long Short-Term Memory) and a GRU (Gated Recurrent Unit) model. These models are well-suited for time-series prediction tasks like air quality forecasting, as they are designed to learn from sequential data and capture patterns over time.

Because of the data availability regarding these two datasets, I took two approaches when predicting air quality:

  1. Using only PM2.5 and PM10 history: In the first approach, I focused solely on past values of PM2.5 and PM10 to predict future air quality. This approach is straightforward and highlights how well the model can predict future values based solely on the historical data of these two pollutants.

  2. Using historical environmental features: In the second approach, I incorporated additional environmental factors such as temperature, humidity, and wind speed, which were found to have the most significant impact on PM2.5 and PM10 based on the feature importance analysis. By including these factors, the model takes into account the broader environmental context, which can influence fluctuations in air quality. This enriched dataset allowed the model to better understand how weather conditions affect pollutant levels and improve the accuracy of the predictions.

Implementation¶

Step 1: Data Cleaning and Preprocessing

  • Load microclimate_sensors and argyle_air_quality datasets from City of Melbourne containing environmental measurements from various sensors across different locations.

  • Clean and preprocess the data to handle missing values and ensure consistency.

Step 2: Exploratory Data Analysis (EDA)

  • Generate summary statistics to understand key metrics like the distribution of PM2.5 and PM10 concentrations over time.

  • Plot time series of PM2.5 and PM10 to observe trends, spikes, and potential outliers.

  • Perform geospatial analysis to understand how air quality varies across different locations by mapping the sensor positions and their corresponding PM measurements.

Step 3: Feature Importance Analysis

  • Investigate how environmental factors like temperature, humidity, and wind influence air quality by examining correlations and using decision tree-based models (e.g., Random Forest) to rank feature importance.

  • Visualize the contribution of each environmental factor to the prediction of air quality indices (PM2.5 and PM10) using feature importance plots from the decision tree models.

Step 4: Data Wrangling

  • Identify gaps in data collection using date ranges and visualize periods of low activity, ensuring a better understanding of missing or anomalous data points.

  • Handle missing values using linear interpolation and aggregate data from different sensors by resampling to hourly intervals for consistency.

  • Normalize the dataset using a MinMaxScaler to scale features such as PM2.5 and PM10 between 0 and 1, preparing the data for time-series modeling with LSTM.

  • Create sequences of 24-hour windows of PM2.5 and PM10 data to be used as input for an LSTM/GRU model, with the goal of predicting the next hour's PM2.5 and PM10 values.

Step 5: Model Development and Evaluation

  • Split the data into training, validation, and test sets (70% training, 15% validation, and 15% test) while maintaining the chronological order.

  • Develop and train an LSTM model with normalized data, using sequences of 24-hour windows to predict future air quality values.

  • Evaluate the model's performance using metrics such as Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) on the test dataset to assess prediction accuracy.

Conclusion¶

By following this use case, you will gain a comprehensive understanding of how to leverage data analysis, and machine learning to predict air quality. The actionable insights and recommendations derived from this analysis will aid in improving air quality management and policy-making, contributing to a healthier and more sustainable urban environment.

References¶

  1. City of Melbourne. (2020). Microclimate Sensors Data.
    Available at: https://data.melbourne.vic.gov.au/explore/dataset/microclimate-sensors-data/information/

  2. City of Melbourne. (2020). Argyle Square Air Quality Data.
    Available at: https://data.melbourne.vic.gov.au/explore/dataset/argyle-square-air-quality/information/

  3. Zhou, Q., & Zheng, Z. (2020). Air quality forecasting using environmental parameters: A case study.
    Journal of Environmental Management, 269, 110774.

  4. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory.
    Neural Computation, 9(8), 1735-1780.

  5. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.
    arXiv preprint arXiv:1412.3555.

  6. Zhang, Y., Li, M., & Wang, T. (2018). PM2.5 forecasting using machine learning models.
    Environmental Science and Pollution Research, 25(19), 19072-19085.

  7. Brownlee, J. (2018). A Gentle Introduction to Long Short-Term Memory Networks (LSTMs) for Time Series Forecasting.
    Machine Learning Mastery. Available at: https://machinelearningmastery.com

  8. Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection.
    Journal of Machine Learning Research, 3, 1157-1182.

Code¶

In [11]:
# Import required modules

import requests
import pandas as pd
from io import StringIO
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import MarkerCluster

import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

Load Datasets¶

In this section, I define a function called Get_Dataset that retrieves data from the API. The function takes a dataset_id as a parameter and returns the dataset in the form of a pandas dataframe. The function performs the following steps:

  • Constructs the API URL based on the dataset_id.
  • Sends a GET request to the API with the specified parameters.
  • If the response status code is 200 (indicating a successful request), the CSV data is read using StringIO and converted to a pandas dataframe.
  • If the response status code is not 200, an error message is printed.
In [2]:
"""
Get unlimited data from the API v2.1

Parameters:
dataset_id (string): dataset if as from city of Melbourne (https://data.melbourne.vic.gov.au/)

Returns:
Pandas Dataframe: Returns the dataset in shape of pandas dataframe
"""
def Get_Dataset(dataset_id): # pass in dataset id
    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets'

    format = 'csv'

    url = f'{base_url}/{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC'
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        dataset = pd.read_csv(StringIO(url_content), delimiter=';')
        return dataset
    else:
        return (print(f'Request failed with status code {response.status_code}'))

Microclimate Dataset¶

In [3]:
# Get the first dataset
microclimate_sensors_df = Get_Dataset('microclimate-sensors-data')
microclimate_sensors_df.head()
Out[3]:
device_id received_at sensorlocation latlong minimumwinddirection averagewinddirection maximumwinddirection minimumwindspeed averagewindspeed gustwindspeed airtemperature relativehumidity atmosphericpressure pm25 pm10 noise
0 ICTMicroclimate-09 2024-07-17T15:33:32+00:00 SkyFarm (Jeff's Shed). Rooftop - Melbourne Con... -37.8223306, 144.9521696 0.0 300.0 359.0 0.0 0.9 3.5 8.7 86.3 1013.1 1.0 4.0 63.1
1 ICTMicroclimate-03 2024-07-17T15:06:13+00:00 CH1 rooftop -37.8140348, 144.96728 0.0 308.0 349.0 0.0 0.4 1.0 8.5 99.0 1008.7 3.0 5.0 69.7
2 ICTMicroclimate-07 2024-07-17T15:21:33+00:00 Tram Stop 7C - Melbourne Tennis Centre Precinc... -37.8222341, 144.9829409 0.0 262.0 354.0 0.0 0.4 1.6 9.0 85.0 1016.1 0.0 0.0 55.3
3 ICTMicroclimate-08 2024-07-17T15:40:34+00:00 Swanston St - Tram Stop 13 adjacent Federation... -37.8184515, 144.9678474 0.0 339.0 359.0 0.0 0.9 4.3 9.0 83.9 1014.1 1.0 1.0 60.6
4 ICTMicroclimate-02 2024-07-17T15:42:47+00:00 101 Collins St L11 Rooftop -37.814604, 144.9702991 7.0 118.0 261.0 1.4 2.1 4.1 9.0 96.7 1009.4 8.0 11.0 69.0

Argyle Square Air Quality Dataset¶

In [4]:
# Get the second dataset
argyle_air_quality_df = Get_Dataset('argyle-square-air-quality')
argyle_air_quality_df.head()
Out[4]:
time dev_id sensor_name lat_long averagespl carbonmonoxide humidity ibatt nitrogendioxide ozone particulateserr particulatesvsn peakspl pm1 pm10 pm25 temperature vbatt vpanel
0 2020-06-09T09:02:38+00:00 ems-ec8a Air Quality Sensor 2 -37.802772, 144.9655513 56.0 -6448.0 65.0 71.0 287.0 137.0 0.0 151.0 69.0 12.0 19.0 17.0 12.3 3.96 0.00
1 2020-06-09T11:17:37+00:00 ems-ec8a Air Quality Sensor 2 -37.802772, 144.9655513 55.0 -6916.0 68.0 89.0 325.0 156.0 0.0 151.0 62.0 15.0 24.0 22.0 10.9 3.93 0.00
2 2022-05-03T21:46:34+00:00 ems-ec8a Air Quality Sensor 2 -37.802772, 144.9655513 58.0 -6261.0 77.0 169.0 268.0 137.0 0.0 151.0 64.0 0.0 0.0 0.0 15.1 3.76 16.33
3 2020-06-09T11:32:37+00:00 ems-ec8a Air Quality Sensor 2 -37.802772, 144.9655513 55.0 -6916.0 69.0 76.0 325.0 156.0 0.0 151.0 68.0 19.0 29.0 24.0 10.5 3.92 0.00
4 2021-05-15T06:04:33+00:00 ems-ec8a Air Quality Sensor 2 -37.802772, 144.9655513 56.0 -6261.0 51.0 12.0 258.0 119.0 0.0 151.0 62.0 0.0 0.0 0.0 14.9 4.01 18.33

Data Cleaning and Preprocessing¶

Microclimate¶

In this section, I perform data cleaning and preprocessing on the microclimate dataset. I check for missing values, fill them with appropriate values, and perform additional data transformations. I also extract latitude and longitude information from the 'latlong' column. Finally, I use the KNNImputer to fill missing latitude and longitude values with the nearest available values.

In [5]:
print(microclimate_sensors_df.info())
print("\n\nMissing values:")
print(microclimate_sensors_df.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90698 entries, 0 to 90697
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   device_id             90698 non-null  object 
 1   received_at           90698 non-null  object 
 2   sensorlocation        85976 non-null  object 
 3   latlong               85976 non-null  object 
 4   minimumwinddirection  79483 non-null  float64
 5   averagewinddirection  90508 non-null  float64
 6   maximumwinddirection  79483 non-null  float64
 7   minimumwindspeed      79483 non-null  float64
 8   averagewindspeed      90508 non-null  float64
 9   gustwindspeed         79483 non-null  float64
 10  airtemperature        90508 non-null  float64
 11  relativehumidity      90508 non-null  float64
 12  atmosphericpressure   90508 non-null  float64
 13  pm25                  83369 non-null  float64
 14  pm10                  83369 non-null  float64
 15  noise                 83369 non-null  float64
dtypes: float64(12), object(4)
memory usage: 11.1+ MB
None


Missing values:
device_id                   0
received_at                 0
sensorlocation           4722
latlong                  4722
minimumwinddirection    11215
averagewinddirection      190
maximumwinddirection    11215
minimumwindspeed        11215
averagewindspeed          190
gustwindspeed           11215
airtemperature            190
relativehumidity          190
atmosphericpressure       190
pm25                     7329
pm10                     7329
noise                    7329
dtype: int64
In [6]:
microclimate_sensors_df.fillna({
    'sensorlocation': 'Unknown',
    'minimumwinddirection': microclimate_sensors_df['minimumwinddirection'].median(),
    'averagewinddirection': microclimate_sensors_df['averagewinddirection'].median(),
    'maximumwinddirection': microclimate_sensors_df['maximumwinddirection'].median(),
    'minimumwindspeed': microclimate_sensors_df['minimumwindspeed'].median(),
    'averagewindspeed': microclimate_sensors_df['averagewindspeed'].median(),
    'gustwindspeed': microclimate_sensors_df['gustwindspeed'].median(),
    'airtemperature': microclimate_sensors_df['airtemperature'].median(),
    'relativehumidity': microclimate_sensors_df['relativehumidity'].median(),
    'atmosphericpressure': microclimate_sensors_df['atmosphericpressure'].median(),
    'pm25': microclimate_sensors_df['pm25'].median(),
    'pm10': microclimate_sensors_df['pm10'].median(),
    'noise': microclimate_sensors_df['noise'].median()
}, inplace=True)

# Extracting Lat & Long
microclimate_sensors_df[['lat', 'long']] = microclimate_sensors_df['latlong'].str.split(',', expand=True).astype(float)

# Using KNNImputer to fill missing Latitude and Longitude values with the nearest available values
microclimate_sensors_df[['lat', 'long']] = KNNImputer(n_neighbors=5).fit_transform(microclimate_sensors_df[['lat', 'long']])


# Converting the 'Time' column to datetime
microclimate_sensors_df['received_at'] = pd.to_datetime(microclimate_sensors_df['received_at'])

# Displaying the first few rows of the processed microclimate_sensors_df
print(microclimate_sensors_df.info())
print("\n\nMissing values:")
print(microclimate_sensors_df.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90698 entries, 0 to 90697
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   device_id             90698 non-null  object             
 1   received_at           90698 non-null  datetime64[ns, UTC]
 2   sensorlocation        90698 non-null  object             
 3   latlong               85976 non-null  object             
 4   minimumwinddirection  90698 non-null  float64            
 5   averagewinddirection  90698 non-null  float64            
 6   maximumwinddirection  90698 non-null  float64            
 7   minimumwindspeed      90698 non-null  float64            
 8   averagewindspeed      90698 non-null  float64            
 9   gustwindspeed         90698 non-null  float64            
 10  airtemperature        90698 non-null  float64            
 11  relativehumidity      90698 non-null  float64            
 12  atmosphericpressure   90698 non-null  float64            
 13  pm25                  90698 non-null  float64            
 14  pm10                  90698 non-null  float64            
 15  noise                 90698 non-null  float64            
 16  lat                   90698 non-null  float64            
 17  long                  90698 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(14), object(3)
memory usage: 12.5+ MB
None


Missing values:
device_id                  0
received_at                0
sensorlocation             0
latlong                 4722
minimumwinddirection       0
averagewinddirection       0
maximumwinddirection       0
minimumwindspeed           0
averagewindspeed           0
gustwindspeed              0
airtemperature             0
relativehumidity           0
atmosphericpressure        0
pm25                       0
pm10                       0
noise                      0
lat                        0
long                       0
dtype: int64

Argyle Square Air Quality¶

In this section, I perform data cleaning and preprocessing on the Argyle Square air quality dataset. I check for missing values and remove rows where all columns have missing values. I also extract latitude and longitude information from the 'lat_long' column and drop the original column.

In [7]:
print(argyle_air_quality_df.info())
print("\n\nMissing values:")
print(argyle_air_quality_df.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142507 entries, 0 to 142506
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   time             142507 non-null  object 
 1   dev_id           142507 non-null  object 
 2   sensor_name      142507 non-null  object 
 3   lat_long         142507 non-null  object 
 4   averagespl       132660 non-null  float64
 5   carbonmonoxide   132660 non-null  float64
 6   humidity         132660 non-null  float64
 7   ibatt            132660 non-null  float64
 8   nitrogendioxide  132660 non-null  float64
 9   ozone            132660 non-null  float64
 10  particulateserr  132660 non-null  float64
 11  particulatesvsn  132660 non-null  float64
 12  peakspl          132660 non-null  float64
 13  pm1              132660 non-null  float64
 14  pm10             132660 non-null  float64
 15  pm25             132660 non-null  float64
 16  temperature      132660 non-null  float64
 17  vbatt            132660 non-null  float64
 18  vpanel           132660 non-null  float64
dtypes: float64(15), object(4)
memory usage: 20.7+ MB
None


Missing values:
time                  0
dev_id                0
sensor_name           0
lat_long              0
averagespl         9847
carbonmonoxide     9847
humidity           9847
ibatt              9847
nitrogendioxide    9847
ozone              9847
particulateserr    9847
particulatesvsn    9847
peakspl            9847
pm1                9847
pm10               9847
pm25               9847
temperature        9847
vbatt              9847
vpanel             9847
dtype: int64
In [8]:
argyle_air_quality_df.dropna(inplace=True)

# Converting the 'Time' column to datetime
argyle_air_quality_df['time'] = pd.to_datetime(argyle_air_quality_df['time'])

print(microclimate_sensors_df.info())
print("\n\nMissing values:")
print(argyle_air_quality_df.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90698 entries, 0 to 90697
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   device_id             90698 non-null  object             
 1   received_at           90698 non-null  datetime64[ns, UTC]
 2   sensorlocation        90698 non-null  object             
 3   latlong               85976 non-null  object             
 4   minimumwinddirection  90698 non-null  float64            
 5   averagewinddirection  90698 non-null  float64            
 6   maximumwinddirection  90698 non-null  float64            
 7   minimumwindspeed      90698 non-null  float64            
 8   averagewindspeed      90698 non-null  float64            
 9   gustwindspeed         90698 non-null  float64            
 10  airtemperature        90698 non-null  float64            
 11  relativehumidity      90698 non-null  float64            
 12  atmosphericpressure   90698 non-null  float64            
 13  pm25                  90698 non-null  float64            
 14  pm10                  90698 non-null  float64            
 15  noise                 90698 non-null  float64            
 16  lat                   90698 non-null  float64            
 17  long                  90698 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(14), object(3)
memory usage: 12.5+ MB
None


Missing values:
time               0
dev_id             0
sensor_name        0
lat_long           0
averagespl         0
carbonmonoxide     0
humidity           0
ibatt              0
nitrogendioxide    0
ozone              0
particulateserr    0
particulatesvsn    0
peakspl            0
pm1                0
pm10               0
pm25               0
temperature        0
vbatt              0
vpanel             0
dtype: int64

Exploratory Data Analysis (EDA)¶

In this section, I conduct exploratory data analysis on the microclimate dataset.

  • Histogram: I visualize the distribution of key variables using histograms. The plots provide insights into the data distribution and help identify any anomalies or patterns.

  • Pair Plots: I create a pairplot of all numerical variables in the microclimate dataset. The pairplot shows the relationships between variables and includes regression lines for each scatter plot. This plot helps us understand the correlations and dependencies between different variables.

Microclimate¶

In [ ]:
# Setting the aesthetics for the plots
sns.set_theme(style="whitegrid")

# Selecting only the numerical columns for the pairplot
microclimate_float_columns = [col for col in microclimate_sensors_df.select_dtypes(include=['float64']).columns if col not in ['lat', 'long']]

# Plotting the distribution of key variables
fig, axes = plt.subplots(4, 3, figsize=(14, 14))

# Flatten axes array for easy iteration
axes = axes.flatten()

for i, col_name in enumerate(microclimate_float_columns):
    sns.histplot(microclimate_sensors_df[col_name], kde=True, ax=axes[i], color='blue')
    axes[i].set_title(f'Distribution of "{col_name}"')
    axes[i].set_xlabel(col_name)  # Add x-axis label
    axes[i].set_ylabel('Count')  # Add y-axis label

plt.tight_layout()
plt.show()
In [ ]:
pairplot = sns.pairplot(
    microclimate_sensors_df[microclimate_float_columns],
    corner=True,
    kind='reg',
    diag_kind='auto',
    plot_kws={'scatter_kws': {'s': 0.5, 'alpha': 0.1, 'color': 'blue'}, 'line_kws': {'color': 'red', 'linewidth': 1}}
)
pairplot.figure.suptitle('Pairplot of All Numerical Variables')
plt.show()
In [ ]:
# Create a heatmap using the correlation matrix
plt.figure(figsize=(10, 8))
heatmap = sns.heatmap(
    microclimate_sensors_df[microclimate_float_columns].corr(),
    annot=True,  # Display the correlation values
    cmap='coolwarm',  # Color map
    fmt=".2f",  # Format the correlation values to two decimal places
    linewidths=0.5,  # Add some space between the squares
    square=True  # Ensure the cells are square-shaped
)

# Add title to the plot
plt.title('Correlation Heatmap of All Numerical Variables for Microclimate', size=15)
plt.show()

Looking at the correlation heatmap, here's how we can interpret the relationships to select parameters for predicting:

PM2.5:

  • Atmospheric Pressure: Correlation of 0.41, moderately positive. This suggests that it is a good candidate for predicting PM2.5.
  • Relative Humidity: Correlation of 0.25, moderately positive, though weaker than Atmospheric Pressure.
  • Air Temperature: Correlation of -0.32, moderately negative.
  • Gust Wind Speed: Slightly negative correlation at -0.23, but still worth considering due to its moderate strength.
  • Average Wind Speed: Slightly negative correlation at -0.17, but still worth considering due to its nature.

PM10:

  • PM2.5: Correlation of 0.99, extremely high. Since PM2.5 is nearly identical to PM10 in correlation, including PM2.5 in the model is crucial for predicting PM10.
  • Atmospheric Pressure: Correlation of 0.39, moderate positive relationship.
  • Relative Humidity: Correlation of 0.26, positive but weaker.
  • Air Temperature: Slightly negative correlation at -0.32.
  • Gust Wind Speed: Slightly negative correlation at -0.24, but still worth considering.
  • Average Wind Speed: Slightly negative correlation at -0.17, but still worth considering due to its nature.

Feature Importance¶

A Decision Tree is an effective method for identifying important features because it inherently selects features that provide the most information gain at each split during the tree construction process. This makes it particularly well-suited for feature importance analysis, as it ranks features based on their contribution to reducing prediction error. Unlike some models that assume linear relationships between variables, Decision Trees can capture both linear and non-linear interactions between features and the target variable. Moreover, they handle different types of data without requiring normalization and are robust to irrelevant features, focusing only on the most predictive ones. This makes them a powerful tool for feature selection in diverse datasets.

In [ ]:
df_filtered = microclimate_sensors_df[microclimate_float_columns]

# Separate features (X) and targets (y) for PM2.5 and PM10
X = df_filtered.drop(columns=['pm25', 'pm10'])
y_pm25 = df_filtered['pm25']
y_pm10 = df_filtered['pm10']

# Split the data into training and test sets
X_train_pm25, X_test_pm25, y_train_pm25, y_test_pm25 = train_test_split(X, y_pm25, test_size=0.2, random_state=42)
X_train_pm10, X_test_pm10, y_train_pm10, y_test_pm10 = train_test_split(X, y_pm10, test_size=0.2, random_state=42)

# Initialize and train Decision Tree Regressor for PM2.5
dt_regressor_pm25 = DecisionTreeRegressor(random_state=42)
dt_regressor_pm25.fit(X_train_pm25, y_train_pm25)

# Initialize and train Decision Tree Regressor for PM10
dt_regressor_pm10 = DecisionTreeRegressor(random_state=42)
dt_regressor_pm10.fit(X_train_pm10, y_train_pm10)

# Extract feature importance
feature_importance_pm25 = dt_regressor_pm25.feature_importances_
feature_importance_pm10 = dt_regressor_pm10.feature_importances_

# Create a dataframe to show the feature importance for both models
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance_PM25': feature_importance_pm25,
    'Importance_PM10': feature_importance_pm10
}).sort_values(by='Importance_PM25', ascending=False)

# Display the feature importance dataframe
feature_importance_df
Out[ ]:
Feature Importance_PM25 Importance_PM10
8 atmosphericpressure 0.350212 0.335914
6 airtemperature 0.144305 0.153030
7 relativehumidity 0.134141 0.138754
9 noise 0.091778 0.095631
5 gustwindspeed 0.085555 0.094950
1 averagewinddirection 0.068889 0.069287
4 averagewindspeed 0.055288 0.051798
2 maximumwinddirection 0.044588 0.042161
3 minimumwindspeed 0.018007 0.012797
0 minimumwinddirection 0.007237 0.005679

Based on both the correlation analysis and the feature importance from the Decision Tree Regressor, we can combine the two methods to finalize the top 5 features for predicting PM2.5 and PM10.

Decision Tree Feature Importance:

  • Atmospheric Pressure and Air Temperature were the top features for both PM2.5 and PM10 predictions.
  • Relative Humidity and Gust Wind Speed also scored high in importance.
  • Noise showed significant importance in both predictions.

By combining the results from both correlation and feature importance, the final 5 features for predicting PM2.5 and PM10 should be:

  • Atmospheric Pressure: Highest importance from both correlation and decision tree.
  • Air Temperature: Strong feature in both analyses.
  • Relative Humidity: Moderate correlation and strong feature importance.
  • Noise: Significant in feature importance.
  • Gust Wind Speed: Showed moderate importance in the decision tree and some relationship in the data.

Location Analysis¶

In [16]:
# Extract unique Device ID and LatLong data
device_location_data = microclimate_sensors_df[['device_id', 'latlong']].dropna().drop_duplicates()

# Split LatLong into separate Latitude and Longitude columns
device_location_data[['lat', 'long']] = device_location_data['latlong'].str.split(',', expand=True)
device_location_data['lat'] = device_location_data['lat'].astype(float)
device_location_data['long'] = device_location_data['long'].astype(float)

# Create a folium map centered around the average latitude and longitude of the sensors
map_center = [device_location_data['lat'].mean(), device_location_data['long'].mean()]
device_map = folium.Map(location=map_center, zoom_start=15)

# Add a marker cluster for better visualization of multiple device locations
device_marker_cluster = MarkerCluster().add_to(device_map)

# Add markers for each unique Device ID
for index, row in device_location_data.iterrows():
    folium.Marker(
        location=[row['lat'], row['long']],
        popup=row['device_id'],
        icon=folium.Icon(color='green', icon='info-sign')
    ).add_to(device_marker_cluster)

# Add a title (using an HTML element)
title_html = '''
             <h3 align="center" style="font-size:20px"><b>Mico Climate Sensors Locations in the City</b></h3>
             '''
device_map.get_root().html.add_child(folium.Element(title_html))

# Add a custom legend
legend_html = '''
     <div style="position: fixed;
     bottom: 50px; left: 50px; width: 200px; height: 100px;
     background-color: white; border:2px solid grey; z-index:9999; font-size:14px;
     ">&nbsp; <b>Legend</b> <br>
     &nbsp; <i class="fa fa-map-marker fa-2x" style="color:green"></i>&nbsp; Device Location <br>
     </div>
     '''
device_map.get_root().html.add_child(folium.Element(legend_html))

# Display the map
device_map
Out[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Since the sensor locations are fairly close to each other in the dataset, it is reasonable to aggregate the data and compute an average for each sensor across the entire Melbourne city region. By aggregating the sensor readings, we can simplify the analysis and obtain an overall average of air quality or other parameters for the city, rather than focusing on individual, closely located sensors. This aggregation can provide a more generalized view of the environmental conditions in Melbourne without the noise introduced by small variations between sensors in close proximity.

Argyle Square Air Quality¶

In [ ]:
# Setting the aesthetics for the plots
sns.set_theme(style="whitegrid")

# Selecting only the numerical columns for the pairplot
float_columns = [col for col in argyle_air_quality_df.select_dtypes(include=['float64']).columns if col not in ['lat', 'long']]

# Plotting the distribution of key variables
fig, axes = plt.subplots(5, 3, figsize=(14, 14))

# Flatten axes array for easy iteration
axes = axes.flatten()

for i, col_name in enumerate(float_columns):
    sns.histplot(argyle_air_quality_df[col_name], kde=True, ax=axes[i], color='blue')
    axes[i].set_title(f'Distribution of "{col_name}"')
    axes[i].set_xlabel('')  # Remove x-axis label
    axes[i].set_ylabel('')  # Remove y-axis label
    axes[i].set_xlabel(col_name)  # Add x-axis label
    axes[i].set_ylabel('Count')  # Add y-axis label

plt.tight_layout()
plt.show()
In [ ]:
pairplot = sns.pairplot(
    argyle_air_quality_df[float_columns],
    corner=True,
    kind='reg',
    diag_kind='auto',
    plot_kws={'scatter_kws': {'s': 0.5, 'alpha': 0.1, 'color': 'blue'}, 'line_kws': {'color': 'red', 'linewidth': 1}}
)
pairplot.figure.suptitle('Pairplot of All Numerical Variables')
plt.show()
In [ ]:
# Create a heatmap using the correlation matrix
plt.figure(figsize=(10, 8))
heatmap = sns.heatmap(
    argyle_air_quality_df[float_columns].corr(),
    annot=True,  # Display the correlation values
    cmap='coolwarm',  # Color map
    fmt=".2f",  # Format the correlation values to two decimal places
    linewidths=0.5,  # Add some space between the squares
    square=True  # Ensure the cells are square-shaped
)

# Add title to the plot
plt.title('Correlation Heatmap of All Numerical Variables for Argyle Square', size=15)
plt.show()
In [19]:
# Extract unique Device ID and LatLong data
sensor_data = argyle_air_quality_df[['dev_id', 'sensor_name', 'lat_long']].drop_duplicates()

# Split lat_long into separate Latitude and Longitude columns
sensor_data[['Latitude', 'Longitude']] = sensor_data['lat_long'].str.split(',', expand=True)
sensor_data['Latitude'] = sensor_data['Latitude'].astype(float)
sensor_data['Longitude'] = sensor_data['Longitude'].astype(float)

# Create a folium map centered around the average latitude and longitude of the sensors
map_center = [sensor_data['Latitude'].mean(), sensor_data['Longitude'].mean()]
sensor_map = folium.Map(location=map_center, zoom_start=18)

# Add a marker cluster for better visualization of multiple sensors
marker_cluster = MarkerCluster().add_to(sensor_map)

# Add markers for each unique sensor
for index, row in sensor_data.iterrows():
    folium.Marker(
        location=[row['Latitude'], row['Longitude']],
        popup=row['sensor_name'],
        icon=folium.Icon(color='blue', icon='info-sign')
    ).add_to(marker_cluster)

# Add a title (using an HTML element)
title_html = '''
             <h3 align="center" style="font-size:20px"><b>Argyle Square Sensor Locations</b></h3>
             '''
sensor_map.get_root().html.add_child(folium.Element(title_html))

# Add a custom legend
legend_html = '''
     <div style="position: fixed;
     bottom: 50px; left: 50px; width: 200px; height: 100px;
     background-color: white; border:2px solid grey; z-index:9999; font-size:14px;
     ">&nbsp; <b>Legend</b> <br>
     &nbsp; <i class="fa fa-map-marker fa-2x" style="color:green"></i>&nbsp; Sensor Location <br>
     </div>
     '''
sensor_map.get_root().html.add_child(folium.Element(legend_html))

# Display the map
sensor_map
Out[19]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Method 1¶

Data Wrangling¶

In this approach, the goal is to predict future air quality using the historical data of PM2.5 and PM10 from multiple sensors across different datasets. Each dataset contains readings from various sensors located in different parts of Melbourne. The first step is to resample the data for each sensor to hourly intervals, ensuring that the readings are aligned on a common time grid. Once the data is resampled, the next step is to aggregate the values from all the sensors to get an overall view of the air quality conditions in Melbourne. This involves calculating the median of the PM2.5 and PM10 values across all sensors for each hour. Finally, any missing time points in the data will be estimated using first-degree (linear) interpolation, ensuring that the dataset is complete and ready for further analysis. This processed dataset will then be used to predict future air quality based on the historical values of PM2.5 and PM10.

Microclimate¶

In [ ]:
# Group the data by device_id
sensor_groups = microclimate_sensors_df.groupby('device_id')

# Prepare an empty dictionary to store sequences for each sensor
sensor_sequences = {}

# Iterate over each sensor group and process data independently
for sensor_id, sensor_data in sensor_groups:
  # Select only relevant columns for LSTM model (Time, PM25, PM10)
  sensor_data = sensor_data[['received_at', 'pm25', 'pm10']]

  # Sort by time to ensure proper time series ordering
  sensor_data = sensor_data.sort_values(by='received_at')

  # Set 'Time' as the index (optional, but useful for resampling)
  sensor_data.set_index('received_at', inplace=True)

  # Resample to a consistent interval (e.g., hourly) if necessary
  sensor_data = sensor_data.resample('H').median()

  # Find the missing hours by comparing the full range with your dataset's 'Time' column
  missing_hours = pd.date_range(start=sensor_data.index.min(), end=sensor_data.index.max(), freq='H').difference(sensor_data.index)

  # Columns with missing values and the count of missing values
  missing_values = sensor_data.isnull().sum()
  missing_columns = missing_values[missing_values > 0]

  # Check the missing hours or values
  if len(missing_hours) != 0 and not missing_columns.empty:
      print("There are some missing time or value")

  # Store the sequences for this sensor
  sensor_sequences[sensor_id] = sensor_data

print(sensor_sequences['ICTMicroclimate-09'].info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2674 entries, 2024-05-29 03:00:00+00:00 to 2024-09-17 12:00:00+00:00
Freq: H
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pm25    2673 non-null   float64
 1   pm10    2673 non-null   float64
dtypes: float64(2)
memory usage: 62.7 KB
None

Argyle Square¶

In [ ]:
# Group the data by dev_id
sensor_groups = argyle_air_quality_df.groupby('dev_id')

# Iterate over each sensor group and process data independently
for sensor_id, sensor_data in sensor_groups:
  # Select only relevant columns for LSTM model (Time, PM25, PM10)
  sensor_data = sensor_data[['time', 'pm25', 'pm10']]

  # Sort by time to ensure proper time series ordering
  sensor_data = sensor_data.sort_values(by='time')

  # Set 'Time' as the index (optional, but useful for resampling)
  sensor_data.set_index('time', inplace=True)

  # Resample to a consistent interval (e.g., hourly) if necessary
  sensor_data = sensor_data.resample('H').median()

  # Find the missing hours by comparing the full range with your dataset's 'Time' column
  missing_hours = pd.date_range(start=sensor_data.index.min(), end=sensor_data.index.max(), freq='H').difference(sensor_data.index)

  # Columns with missing values and the count of missing values
  missing_values = sensor_data.isnull().sum()
  missing_columns = missing_values[missing_values > 0]

  # Check the missing hours or values
  if len(missing_hours) != 0 and not missing_columns.empty:
      print("There are some missing time or value")

  # Store the sequences for this sensor
  sensor_sequences[sensor_id] = sensor_data

print(sensor_sequences['ems-ec8a'].info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35421 entries, 2020-06-09 08:00:00+00:00 to 2024-06-24 04:00:00+00:00
Freq: H
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pm25    26849 non-null  float64
 1   pm10    26849 non-null  float64
dtypes: float64(2)
memory usage: 830.2 KB
None
In [ ]:
# print the of sensor_sequences with start and end time and number of records

for sensor_id, sensor_data in sensor_sequences.items():
  print(f"Sensor ID: {sensor_id}")
  print(f"Start Time: {sensor_data.index.min()}")
  print(f"End Time: {sensor_data.index.max()}")
  print(f"Number of Records: {len(sensor_data)}")
  print("-" * 20)
Sensor ID: ICTMicroclimate-01
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-02
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-03
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-04
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-05
Start Time: 2024-07-01 21:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 1864
--------------------
Sensor ID: ICTMicroclimate-06
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-07
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-08
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-09
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Sensor ID: ICTMicroclimate-10
Start Time: 2024-08-13 00:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 853
--------------------
Sensor ID: ICTMicroclimate-11
Start Time: 2024-08-13 00:00:00+00:00
End Time: 2024-09-05 06:00:00+00:00
Number of Records: 559
--------------------
Sensor ID: aws5-0999
Start Time: 2024-05-29 05:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2672
--------------------
Sensor ID: ems-ce10
Start Time: 2020-06-09 15:00:00+00:00
End Time: 2024-06-17 01:00:00+00:00
Number of Records: 35243
--------------------
Sensor ID: ems-ec8a
Start Time: 2020-06-09 08:00:00+00:00
End Time: 2024-06-24 04:00:00+00:00
Number of Records: 35421
--------------------
In [ ]:
# Create an empty list to store resampled data from all sensors
all_sensors_data = []
for sensor_id, sensor_data in sensor_sequences.items():
  all_sensors_data.append(sensor_data)

# Concatenate all sensor data into a single DataFrame
combined_data = pd.concat(all_sensors_data)

# Group by the time index and calculate the median to get hourly value across all sensors
hourly_data = combined_data.groupby(combined_data.index).median()

# Interpolate missing values linearly (optional)
hourly_data = hourly_data.interpolate(method='linear')

# Print the first few rows of the final hourly data
print(f"Start Time: {hourly_data.index.min()}")
print(f"End Time: {hourly_data.index.max()}")
print(f"Number of Records: {len(hourly_data)}")
print("-" * 20)

hourly_data.head()
Start Time: 2020-06-09 08:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 37469
--------------------
Out[ ]:
pm25 pm10
2020-06-09 08:00:00+00:00 17.0 18.0
2020-06-09 09:00:00+00:00 15.0 18.0
2020-06-09 10:00:00+00:00 16.0 18.5
2020-06-09 11:00:00+00:00 23.0 26.5
2020-06-09 12:00:00+00:00 26.0 29.0
In [ ]:
# Create a figure and axes for the subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 4))

# Plot pm25 on the first subplot
hourly_data['pm25'].plot(kind='line', ax=axes[0], title='pm25')
axes[0].spines[['top', 'right']].set_visible(False)

# Plot pm10 on the second subplot
hourly_data['pm10'].plot(kind='line', ax=axes[1], title='pm10')
axes[1].spines[['top', 'right']].set_visible(False)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()
In [ ]:
# Normalize the data using MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the scaler to 'pm25' and 'pm10' columns
hourly_data[['pm25', 'pm10']] = scaler.fit_transform(hourly_data[['pm25', 'pm10']])

hourly_data.head()
Out[ ]:
pm25 pm10
2020-06-09 08:00:00+00:00 0.173469 0.161798
2020-06-09 09:00:00+00:00 0.153061 0.161798
2020-06-09 10:00:00+00:00 0.163265 0.166292
2020-06-09 11:00:00+00:00 0.234694 0.238202
2020-06-09 12:00:00+00:00 0.265306 0.260674

Between late August 22, 2023, and February 7, 2024, the data appears to be consistently low, which could suggest either a period of minimal pollution or possible issues with sensor data collection, such as missing data or sensor downtimes. I will keep this in mind when preparing the data and ensure that no samples are taken from this period.

In [ ]:
# Filter out the period from August 22, 2023, to February 7, 2024
hourly_data = hourly_data[(hourly_data.index < '2023-08-22') | (hourly_data.index > '2024-02-07')]

# Function to create sequences of the last 24 hours to predict the next time step
def create_sequences(data, n_steps=24):
  X, y = [], []
  for i in range(len(data) - n_steps):
      X.append(data[i:i + n_steps])
      y.append(data[i + n_steps])
  return np.array(X), np.array(y)

# Use 'pm25' and 'pm10' columns to create sequences
data_values = hourly_data[['pm25', 'pm10']].values
X, y = create_sequences(data_values)
In [ ]:
# Split the data into train (70%), validation (15%), and test sets (15%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, shuffle=False)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, shuffle=False)

# Print the shapes of the train, validation, and test sets
print(f'Training set shape: {X_train.shape}, {y_train.shape}')
print(f'Validation set shape: {X_valid.shape}, {y_valid.shape}')
print(f'Test set shape: {X_test.shape}, {y_test.shape}')
Training set shape: (23371, 24, 2), (23371, 2)
Validation set shape: (5008, 24, 2), (5008, 2)
Test set shape: (5009, 24, 2), (5009, 2)

Deep Learning Model¶

LSTM¶

In [ ]:
# Define the LSTM model
lstm_model_1 = Sequential()
lstm_model_1.add(Input(shape=(X_train.shape[1], X_train.shape[2])))  # Input layer with shape (24 timesteps, 2 features: pm25 and pm10)
lstm_model_1.add(LSTM(32, activation='relu', return_sequences=True)) # First LSTM layer with 32 units, returning sequences to the next LSTM layer
lstm_model_1.add(LSTM(16, activation='relu', return_sequences=True)) # Second LSTM layer with 16 units, returning sequences to the next LSTM layer
lstm_model_1.add(LSTM(8, activation='relu')) # Third LSTM layer with 8 units, returning only the final output
lstm_model_1.add(Dense(2)) # Output layer with 2 neurons to predict both pm25 and pm10

# Compile the model with Adam optimizer and learning rate of 0.001
lstm_model_1.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
In [ ]:
# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Model checkpoint to save the best model
checkpoint = ModelCheckpoint('best_lstm_model_1.keras', monitor='val_loss', save_best_only=True, mode='min')

# Train the model
lstm_history_1 = lstm_model_1.fit(X_train, y_train,
                                  epochs=100,
                                  validation_data=(X_valid, y_valid),
                                  callbacks=[early_stopping, checkpoint])
Epoch 1/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0826 - val_loss: 0.0121
Epoch 2/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - loss: 0.0104 - val_loss: 0.0022
Epoch 3/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 0.0027 - val_loss: 0.0019
Epoch 4/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0022 - val_loss: 0.0019
Epoch 5/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - loss: 0.0025 - val_loss: 0.0027
Epoch 6/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 7s 10ms/step - loss: 0.0023 - val_loss: 0.0025
Epoch 7/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0022 - val_loss: 0.0016
Epoch 8/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0020 - val_loss: 0.0014
Epoch 9/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 0.0023 - val_loss: 0.0027
Epoch 10/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 0.0031 - val_loss: 0.0017
Epoch 11/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 0.0023 - val_loss: 0.0028
Epoch 12/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0022 - val_loss: 0.0013
Epoch 13/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - loss: 0.0019 - val_loss: 0.0015
Epoch 14/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 10ms/step - loss: 0.0019 - val_loss: 0.0010
Epoch 15/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0014 - val_loss: 7.1778e-04
Epoch 16/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 7s 10ms/step - loss: 0.0012 - val_loss: 9.3073e-04
Epoch 17/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0012 - val_loss: 7.2092e-04
Epoch 18/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 7s 10ms/step - loss: 0.0011 - val_loss: 7.2257e-04
Epoch 19/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 0.0011 - val_loss: 7.7058e-04
Epoch 20/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0010 - val_loss: 6.7137e-04
Epoch 21/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 10ms/step - loss: 0.0011 - val_loss: 7.8174e-04
Epoch 22/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 10ms/step - loss: 9.3796e-04 - val_loss: 0.0012
Epoch 23/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 0.0010 - val_loss: 6.6988e-04
Epoch 24/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 7s 10ms/step - loss: 0.0010 - val_loss: 6.3833e-04
Epoch 25/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 9.7123e-04 - val_loss: 6.4077e-04
Epoch 26/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - loss: 0.0011 - val_loss: 6.4908e-04
Epoch 27/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 9.9949e-04 - val_loss: 6.3411e-04
Epoch 28/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 9.6205e-04 - val_loss: 7.9660e-04
Epoch 29/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 10ms/step - loss: 9.6711e-04 - val_loss: 7.6460e-04
Epoch 30/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 8s 11ms/step - loss: 9.6011e-04 - val_loss: 8.3563e-04
Epoch 31/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 10s 11ms/step - loss: 9.6641e-04 - val_loss: 7.9858e-04
Epoch 32/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 7s 10ms/step - loss: 9.9811e-04 - val_loss: 6.6309e-04
In [ ]:
# Plot the training and validation loss
plt.figure(figsize=(5, 3))
plt.plot(lstm_history_1.history['loss'], label='Train Loss')
plt.plot(lstm_history_1.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss over Epochs (LSTM)')
plt.xlabel('Epochs')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True)
plt.show()
In [ ]:
# Evaluate the model on the test set
test_loss = lstm_model_1.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}')
157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - loss: 3.9764e-04
Test Loss: 0.0004461882926989347
In [ ]:
# Predict on the test set
y_pred = lstm_model_1.predict(X_test)

# Extract the real and predicted values for pm25 and pm10
num_samples = 300
real_pm25 = y_test[:num_samples, 0]
real_pm10 = y_test[:num_samples, 1]
predicted_pm25 = y_pred[:num_samples, 0]
predicted_pm10 = y_pred[:num_samples, 1]

# Plotting the real and predicted values for pm25 and pm10 in two subplots
plt.figure(figsize=(14, 6))

# Subplot 1 for PM2.5
plt.subplot(1, 2, 1)
plt.plot(real_pm25, label='Real PM2.5', color='blue')
plt.plot(predicted_pm25, label='Predicted PM2.5', color='orange', linestyle='--')
plt.title('Real vs Predicted PM2.5 (LSTM)')
plt.xlabel('Time Steps')
plt.ylabel('PM2.5 Value')
plt.legend()
plt.grid(True)

# Subplot 2 for PM10
plt.subplot(1, 2, 2)
plt.plot(real_pm10, label='Real PM10', color='green')
plt.plot(predicted_pm10, label='Predicted PM10', color='red', linestyle='--')
plt.title('Real vs Predicted PM10 (LSTM)')
plt.xlabel('Time Steps')
plt.ylabel('PM10 Value')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()
157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step

GRU¶

In [ ]:
# Define the GRU model
gru_model_1 = Sequential()
gru_model_1.add(Input(shape=(X_train.shape[1], X_train.shape[2])))  # Input layer with shape (24 timesteps, 2 features: pm25 and pm10)
gru_model_1.add(GRU(32, activation='relu', return_sequences=True)) # First LSTM layer with 32 units, returning sequences to the next LSTM layer
gru_model_1.add(GRU(16, activation='relu', return_sequences=True)) # Second LSTM layer with 16 units, returning sequences to the next LSTM layer
gru_model_1.add(GRU(8, activation='relu')) # Third LSTM layer with 8 units, returning only the final output
gru_model_1.add(Dense(2)) # Output layer with 2 neurons to predict both pm25 and pm10

# Compile the model with Adam optimizer and learning rate of 0.001
gru_model_1.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
In [ ]:
# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Model checkpoint to save the best model
checkpoint = ModelCheckpoint('best_gru_model_1.keras', monitor='val_loss', save_best_only=True, mode='min')

# Train the model
gru_history_1 = gru_model_1.fit(X_train, y_train,
                                  epochs=100,
                                  validation_data=(X_valid, y_valid),
                                  callbacks=[early_stopping, checkpoint])
Epoch 1/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 22s 21ms/step - loss: 0.0028 - val_loss: 6.4584e-04
Epoch 2/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 9.2629e-04 - val_loss: 6.4507e-04
Epoch 3/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 20s 16ms/step - loss: 8.7834e-04 - val_loss: 6.6083e-04
Epoch 4/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 20s 16ms/step - loss: 9.7132e-04 - val_loss: 6.6739e-04
Epoch 5/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 20s 16ms/step - loss: 9.3562e-04 - val_loss: 6.3789e-04
Epoch 6/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 9.6577e-04 - val_loss: 6.5164e-04
Epoch 7/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 21s 16ms/step - loss: 9.2632e-04 - val_loss: 6.4041e-04
Epoch 8/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 0.0010 - val_loss: 6.2361e-04
Epoch 9/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 9.3825e-04 - val_loss: 6.6791e-04
Epoch 10/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 21s 16ms/step - loss: 9.3960e-04 - val_loss: 6.3782e-04
Epoch 11/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 9.2917e-04 - val_loss: 6.8973e-04
Epoch 12/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 9.1833e-04 - val_loss: 6.2344e-04
Epoch 13/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 21s 16ms/step - loss: 9.1694e-04 - val_loss: 6.2064e-04
Epoch 14/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 20s 16ms/step - loss: 9.0261e-04 - val_loss: 6.9610e-04
Epoch 15/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 9.2799e-04 - val_loss: 7.0163e-04
Epoch 16/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 9.4980e-04 - val_loss: 8.3384e-04
Epoch 17/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 8.7773e-04 - val_loss: 6.4464e-04
Epoch 18/100
731/731 ━━━━━━━━━━━━━━━━━━━━ 12s 16ms/step - loss: 8.7739e-04 - val_loss: 6.6473e-04
In [ ]:
# Plot the training and validation loss
plt.figure(figsize=(5, 3))
plt.plot(gru_history_1.history['loss'], label='Train Loss')
plt.plot(gru_history_1.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss over Epochs (GRU)')
plt.xlabel('Epochs')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True)
plt.show()
In [ ]:
# Evaluate the model on the test set
test_loss = gru_model_1.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}')
157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - loss: 3.6584e-04
Test Loss: 0.000416912225773558
In [ ]:
# Predict on the test set
y_pred = gru_model_1.predict(X_test)

# Extract the real and predicted values for pm25 and pm10
num_samples = 300
real_pm25 = y_test[:num_samples, 0]
real_pm10 = y_test[:num_samples, 1]
predicted_pm25 = y_pred[:num_samples, 0]
predicted_pm10 = y_pred[:num_samples, 1]

# Plotting the real and predicted values for pm25 and pm10 in two subplots
plt.figure(figsize=(14, 6))

# Subplot 1 for PM2.5
plt.subplot(1, 2, 1)
plt.plot(real_pm25, label='Real PM2.5', color='blue')
plt.plot(predicted_pm25, label='Predicted PM2.5', color='orange', linestyle='--')
plt.title('Real vs Predicted PM2.5 (GRU)')
plt.xlabel('Time Steps')
plt.ylabel('PM2.5 Value')
plt.legend()
plt.grid(True)

# Subplot 2 for PM10
plt.subplot(1, 2, 2)
plt.plot(real_pm10, label='Real PM10', color='green')
plt.plot(predicted_pm10, label='Predicted PM10', color='red', linestyle='--')
plt.title('Real vs Predicted PM10 (GRU)')
plt.xlabel('Time Steps')
plt.ylabel('PM10 Value')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()
157/157 ━━━━━━━━━━━━━━━━━━━━ 2s 10ms/step

The LSTM and GRU models both perform well in predicting future PM2.5 and PM10 values based on historical data, with LSTM showing slightly better accuracy in capturing sharp spikes, particularly for PM10, while GRU exhibits more deviation during peaks. However, GRU has a marginally lower test loss (0.0004169 vs. LSTM's 0.0004462), indicating it made fewer overall errors in prediction. While LSTM might be preferable for scenarios where sharp fluctuations in air quality are critical, GRU's efficiency and lower complexity make it a strong alternative, especially for general use cases. The choice between the two depends on whether accuracy in extreme events or computational efficiency is prioritized.

Method 2¶

Data Wrangling¶

The goal of this approach is to predict future air quality by focusing on the most important features related to PM2.5 and PM10: Atmospheric Pressure, Air Temperature, Relative Humidity, Noise, and Gust Wind Speed. To achieve this, we will use the Microclimate Sensor dataset, which contains readings from various sensors across Melbourne. The first step is to resample the data from each sensor to hourly intervals to align all the readings on a consistent time grid. After resampling, we will aggregate the values from all sensors to get an overall view of Melbourne's air quality by calculating the median of each feature across all sensors for each hour. Finally, any missing time points will be estimated using linear interpolation to ensure a complete dataset for analysis. This processed dataset will then be used to predict future air quality based on historical data.

In [ ]:
# Group the data by device_id
sensor_groups = microclimate_sensors_df.groupby('device_id')

# Prepare an empty dictionary to store sequences for each sensor
sensor_sequences = {}

# Iterate over each sensor group and process data independently
for sensor_id, sensor_data in sensor_groups:
  # Select only relevant columns for LSTM model (Time, PM25, PM10)
  sensor_data = sensor_data[['received_at', 'pm25', 'pm10', 'atmosphericpressure', 'relativehumidity', 'airtemperature', 'gustwindspeed', 'noise']]

  # Sort by time to ensure proper time series ordering
  sensor_data = sensor_data.sort_values(by='received_at')

  # Set 'Time' as the index (optional, but useful for resampling)
  sensor_data.set_index('received_at', inplace=True)

  # Resample to a consistent interval (e.g., hourly) if necessary
  sensor_data = sensor_data.resample('H').median()

  # Find the missing hours by comparing the full range with your dataset's 'Time' column
  missing_hours = pd.date_range(start=sensor_data.index.min(), end=sensor_data.index.max(), freq='H').difference(sensor_data.index)

  # Columns with missing values and the count of missing values
  missing_values = sensor_data.isnull().sum()
  missing_columns = missing_values[missing_values > 0]

  # Check the missing hours or values
  if len(missing_hours) != 0 and not missing_columns.empty:
      print("There are some missing time or value")

  # Store the sequences for this sensor
  sensor_sequences[sensor_id] = sensor_data

print(sensor_sequences['ICTMicroclimate-09'].info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2674 entries, 2024-05-29 03:00:00+00:00 to 2024-09-17 12:00:00+00:00
Freq: H
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pm25                 2673 non-null   float64
 1   pm10                 2673 non-null   float64
 2   atmosphericpressure  2673 non-null   float64
 3   relativehumidity     2673 non-null   float64
 4   airtemperature       2673 non-null   float64
 5   gustwindspeed        2673 non-null   float64
 6   noise                2673 non-null   float64
dtypes: float64(7)
memory usage: 167.1 KB
None
In [ ]:
# Create an empty list to store resampled data from all sensors
all_sensors_data = []
for sensor_id, sensor_data in sensor_sequences.items():
  all_sensors_data.append(sensor_data)

# Concatenate all sensor data into a single DataFrame
combined_data = pd.concat(all_sensors_data)

# Group by the time index and calculate the median to get hourly value across all sensors
hourly_data = combined_data.groupby(combined_data.index).median()

# Interpolate missing values linearly (optional)
hourly_data = hourly_data.interpolate(method='linear')

# Print the first few rows of the final hourly data
print(f"Start Time: {hourly_data.index.min()}")
print(f"End Time: {hourly_data.index.max()}")
print(f"Number of Records: {len(hourly_data)}")
print("-" * 20)

hourly_data.head()
Start Time: 2024-05-29 03:00:00+00:00
End Time: 2024-09-17 12:00:00+00:00
Number of Records: 2674
--------------------
Out[ ]:
pm25 pm10 atmosphericpressure relativehumidity airtemperature gustwindspeed noise
received_at
2024-05-29 03:00:00+00:00 12.5 14.00 1019.399994 43.900 19.40 7.350 71.350
2024-05-29 04:00:00+00:00 12.0 14.25 1019.149988 43.775 19.45 6.525 71.175
2024-05-29 05:00:00+00:00 12.0 14.00 1018.800018 42.900 19.55 5.800 70.900
2024-05-29 06:00:00+00:00 10.5 11.00 1018.500000 44.900 18.75 4.700 71.300
2024-05-29 07:00:00+00:00 12.5 14.50 1018.649994 48.350 17.80 3.900 71.350
In [ ]:
# Create a figure and axes for the subplots
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# Flatten the axes array to easily iterate over subplots
axes = axes.flatten()

# Plot pm25 on the first subplot
hourly_data['pm25'].plot(kind='line', ax=axes[0], title='pm25')
axes[0].spines[['top', 'right']].set_visible(False)

# Plot noise on the second subplot
hourly_data['pm10'].plot(kind='line', ax=axes[1], title='pm10')
axes[1].spines[['top', 'right']].set_visible(False)

# Plot atmosphericpressure on the second subplot
hourly_data['atmosphericpressure'].plot(kind='line', ax=axes[2], title='atmosphericpressure')
axes[2].spines[['top', 'right']].set_visible(False)

# Plot relativehumidity on the second subplot
hourly_data['relativehumidity'].plot(kind='line', ax=axes[3], title='relativehumidity')
axes[3].spines[['top', 'right']].set_visible(False)

# Plot airtemperature on the second subplot
hourly_data['airtemperature'].plot(kind='line', ax=axes[4], title='airtemperature')
axes[4].spines[['top', 'right']].set_visible(False)

# Plot gustwindspeed on the second subplot
hourly_data['gustwindspeed'].plot(kind='line', ax=axes[5], title='gustwindspeed')
axes[5].spines[['top', 'right']].set_visible(False)

# Plot noise on the second subplot
hourly_data['noise'].plot(kind='line', ax=axes[6], title='noise')
axes[6].spines[['top', 'right']].set_visible(False)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()
In [ ]:
# Normalize the data using MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the scaler to 'pm25' and 'pm10' columns
hourly_data[['pm25', 'pm10', 'atmosphericpressure', 'relativehumidity', 'airtemperature', 'gustwindspeed', 'noise']] = scaler.fit_transform(hourly_data[['pm25', 'pm10', 'atmosphericpressure', 'relativehumidity', 'airtemperature', 'gustwindspeed', 'noise']])

hourly_data.head()
Out[ ]:
pm25 pm10 atmosphericpressure relativehumidity airtemperature gustwindspeed noise
received_at
2024-05-29 03:00:00+00:00 0.238095 0.190476 0.582227 0.266846 0.810624 0.726027 0.549020
2024-05-29 04:00:00+00:00 0.228571 0.194139 0.577120 0.265162 0.812933 0.635616 0.542157
2024-05-29 05:00:00+00:00 0.228571 0.190476 0.569970 0.253369 0.817552 0.556164 0.531373
2024-05-29 06:00:00+00:00 0.200000 0.146520 0.563841 0.280323 0.780600 0.435616 0.547059
2024-05-29 07:00:00+00:00 0.238095 0.197802 0.566905 0.326819 0.736721 0.347945 0.549020
In [ ]:
# Function to create sequences of the last 24 hours to predict the next time step
def create_sequences(data, n_steps=24):
  X, y = [], []
  for i in range(len(data) - n_steps):
      X.append(data[i:i + n_steps])
      y.append(data[i + n_steps][:2])
  return np.array(X), np.array(y)

# Use 'pm25' and 'pm10' columns to create sequences
data_values = hourly_data[['pm25', 'pm10', 'atmosphericpressure', 'relativehumidity', 'airtemperature', 'gustwindspeed', 'noise']].values
X, y = create_sequences(data_values)
In [ ]:
# Split the data into train (80%), validation (10%), and test sets (10%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, shuffle=False)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, shuffle=False)

# Print the shapes of the train, validation, and test sets
print(f'Training set shape: {X_train.shape}, {y_train.shape}')
print(f'Validation set shape: {X_valid.shape}, {y_valid.shape}')
print(f'Test set shape: {X_test.shape}, {y_test.shape}')
Training set shape: (2120, 24, 7), (2120, 2)
Validation set shape: (265, 24, 7), (265, 2)
Test set shape: (265, 24, 7), (265, 2)

Deep Learning Model¶

LSTM¶

In [ ]:
# Define the LSTM model
lstm_model_2 = Sequential()
lstm_model_2.add(Input(shape=(X_train.shape[1], X_train.shape[2])))  # Input layer with shape (24 timesteps, 7 features)
lstm_model_2.add(LSTM(32, activation='relu', return_sequences=True)) # First LSTM layer with 32 units, returning sequences to the next LSTM layer
lstm_model_2.add(LSTM(16, activation='relu', return_sequences=True)) # Second LSTM layer with 16 units, returning sequences to the next LSTM layer
lstm_model_2.add(LSTM(8, activation='relu')) # Third LSTM layer with 8 units, returning only the final output
lstm_model_2.add(Dense(2)) # Output layer with 2 neurons to predict both pm25 and pm10

# Compile the model with Adam optimizer and learning rate of 0.001
lstm_model_2.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
In [ ]:
# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Model checkpoint to save the best model
checkpoint = ModelCheckpoint('best_lstm_model_2.keras', monitor='val_loss', save_best_only=True, mode='min')

# Train the model
lstm_history_2 = lstm_model_2.fit(X_train, y_train,
                                  epochs=100,
                                  validation_data=(X_valid, y_valid),
                                  callbacks=[early_stopping, checkpoint])
Epoch 1/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 14s 105ms/step - loss: 0.0451 - val_loss: 0.0014
Epoch 2/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0138 - val_loss: 8.5071e-04
Epoch 3/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0093 - val_loss: 7.3592e-04
Epoch 4/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 0.0077 - val_loss: 8.2320e-04
Epoch 5/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0074 - val_loss: 8.1421e-04
Epoch 6/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0066 - val_loss: 0.0017
Epoch 7/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 0.0047 - val_loss: 6.6603e-04
Epoch 8/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 0.0038 - val_loss: 6.3818e-04
Epoch 9/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 0.0044 - val_loss: 5.9249e-04
Epoch 10/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0039 - val_loss: 7.3337e-04
Epoch 11/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0036 - val_loss: 7.1967e-04
Epoch 12/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0026 - val_loss: 4.9864e-04
Epoch 13/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0033 - val_loss: 6.3739e-04
Epoch 14/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0024 - val_loss: 4.9079e-04
Epoch 15/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0029 - val_loss: 5.7002e-04
Epoch 16/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0026 - val_loss: 7.7015e-04
Epoch 17/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0023 - val_loss: 5.1666e-04
Epoch 18/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0020 - val_loss: 5.6037e-04
Epoch 19/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 0.0021 - val_loss: 4.7819e-04
Epoch 20/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 0.0023 - val_loss: 4.1518e-04
Epoch 21/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 0.0026 - val_loss: 6.2422e-04
Epoch 22/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0024 - val_loss: 5.3881e-04
Epoch 23/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0022 - val_loss: 4.4096e-04
Epoch 24/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0021 - val_loss: 5.1176e-04
Epoch 25/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0021 - val_loss: 5.3727e-04
In [ ]:
# Plot the training and validation loss
plt.figure(figsize=(5, 3))
plt.plot(lstm_history_2.history['loss'], label='Train Loss')
plt.plot(lstm_history_2.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss over Epochs (LSTM)')
plt.xlabel('Epochs')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True)
plt.show()
In [ ]:
# Evaluate the model on the test set
test_loss = lstm_model_2.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}')
9/9 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 9.9719e-04
Test Loss: 0.0009264189284294844
In [ ]:
# Predict on the test set
y_pred = lstm_model_2.predict(X_test)

# Extract the real and predicted values for pm25 and pm10
real_pm25 = y_test[:, 0]
real_pm10 = y_test[:, 1]
predicted_pm25 = y_pred[:, 0]
predicted_pm10 = y_pred[:, 1]

# Plotting the real and predicted values for pm25 and pm10 in two subplots
plt.figure(figsize=(14, 6))

# Subplot 1 for PM2.5
plt.subplot(1, 2, 1)
plt.plot(real_pm25, label='Real PM2.5', color='blue')
plt.plot(predicted_pm25, label='Predicted PM2.5', color='orange', linestyle='--')
plt.title('Real vs Predicted PM2.5 (LSTM)')
plt.xlabel('Time Steps')
plt.ylabel('PM2.5 Value')
plt.legend()
plt.grid(True)

# Subplot 2 for PM10
plt.subplot(1, 2, 2)
plt.plot(real_pm10, label='Real PM10', color='green')
plt.plot(predicted_pm10, label='Predicted PM10', color='red', linestyle='--')
plt.title('Real vs Predicted PM10 (LSTM)')
plt.xlabel('Time Steps')
plt.ylabel('PM10 Value')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()
9/9 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step 

GRU¶

In [ ]:
# Define the GRU model
gru_model_2 = Sequential()
gru_model_2.add(Input(shape=(X_train.shape[1], X_train.shape[2])))  # Input layer with shape (24 timesteps, 7 features)
gru_model_2.add(GRU(32, activation='relu', return_sequences=True)) # First LSTM layer with 32 units, returning sequences to the next LSTM layer
gru_model_2.add(GRU(16, activation='relu', return_sequences=True)) # Second LSTM layer with 16 units, returning sequences to the next LSTM layer
gru_model_2.add(GRU(8, activation='relu')) # Third LSTM layer with 8 units, returning only the final output
gru_model_2.add(Dense(2)) # Output layer with 2 neurons to predict both pm25 and pm10

# Compile the model with Adam optimizer and learning rate of 0.001
gru_model_2.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
In [ ]:
# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Model checkpoint to save the best model
checkpoint = ModelCheckpoint('best_gru_model_2.keras', monitor='val_loss', save_best_only=True, mode='min')

# Train the model
gru_history_2 = gru_model_2.fit(X_train, y_train,
                                  epochs=100,
                                  validation_data=(X_valid, y_valid),
                                  callbacks=[early_stopping, checkpoint])
Epoch 1/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 7s 101ms/step - loss: 0.0398 - val_loss: 7.9680e-04
Epoch 2/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 5s 28ms/step - loss: 0.0181 - val_loss: 6.7085e-04
Epoch 3/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 0.0036 - val_loss: 7.3469e-04
Epoch 4/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 2s 17ms/step - loss: 0.0034 - val_loss: 6.0669e-04
Epoch 5/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - loss: 0.0023 - val_loss: 4.8808e-04
Epoch 6/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 19ms/step - loss: 0.0023 - val_loss: 4.2053e-04
Epoch 7/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - loss: 0.0021 - val_loss: 3.7794e-04
Epoch 8/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - loss: 0.0018 - val_loss: 3.9725e-04
Epoch 9/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step - loss: 0.0020 - val_loss: 4.2602e-04
Epoch 10/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 16ms/step - loss: 0.0020 - val_loss: 4.2271e-04
Epoch 11/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 19ms/step - loss: 0.0021 - val_loss: 3.9086e-04
Epoch 12/100
67/67 ━━━━━━━━━━━━━━━━━━━━ 2s 18ms/step - loss: 0.0021 - val_loss: 3.9553e-04
In [ ]:
# Plot the training and validation loss
plt.figure(figsize=(5, 3))
plt.plot(gru_history_2.history['loss'], label='Train Loss')
plt.plot(gru_history_2.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss over Epochs (GRU)')
plt.xlabel('Epochs')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True)
plt.show()
In [ ]:
# Evaluate the model on the test set
test_loss = gru_model_2.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}')
9/9 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - loss: 9.8798e-04 
Test Loss: 0.0010076325852423906
In [ ]:
# Predict on the test set
y_pred = gru_model_2.predict(X_test)

# Extract the real and predicted values for pm25 and pm10
num_samples = 300
real_pm25 = y_test[:num_samples, 0]
real_pm10 = y_test[:num_samples, 1]
predicted_pm25 = y_pred[:num_samples, 0]
predicted_pm10 = y_pred[:num_samples, 1]

# Plotting the real and predicted values for pm25 and pm10 in two subplots
plt.figure(figsize=(14, 6))

# Subplot 1 for PM2.5
plt.subplot(1, 2, 1)
plt.plot(real_pm25, label='Real PM2.5', color='blue')
plt.plot(predicted_pm25, label='Predicted PM2.5', color='orange', linestyle='--')
plt.title('Real vs Predicted PM2.5 (GRU)')
plt.xlabel('Time Steps')
plt.ylabel('PM2.5 Value')
plt.legend()
plt.grid(True)

# Subplot 2 for PM10
plt.subplot(1, 2, 2)
plt.plot(real_pm10, label='Real PM10', color='green')
plt.plot(predicted_pm10, label='Predicted PM10', color='red', linestyle='--')
plt.title('Real vs Predicted PM10 (GRU)')
plt.xlabel('Time Steps')
plt.ylabel('PM10 Value')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()
9/9 ━━━━━━━━━━━━━━━━━━━━ 2s 149ms/step

In the second approach, where both environmental factors (e.g., temperature, humidity, wind) and historical PM2.5 and PM10 data were used to predict future air quality, the LSTM model outperformed the GRU model. The LSTM model had a lower test loss (0.0009264 vs. GRU's 0.0010076) and provided better alignment with the real PM2.5 and PM10 values, particularly around significant spikes and gradual fluctuations. While both models captured overall trends well, the GRU showed more deviation during complex changes, making the LSTM model more accurate and reliable in this scenario. Overall, incorporating additional environmental factors improved both models' performance, but LSTM had the edge in terms of prediction accuracy.

Comparison of Best Results for Each Approach¶

When comparing the two approaches, the GRU model performed best when using only PM2.5 and PM10 historical data, achieving a lower test loss (0.0004169) than the LSTM model. However, when additional environmental factors like temperature, humidity, and wind were included, the LSTM model outperformed the GRU, handling complex fluctuations and spikes more accurately despite a slightly higher test loss (0.0009264). It's important to note that the dataset and the number of records available for training the model using the second approach (with environmental factors) were significantly smaller compared to the first approach, which may have influenced the performance and overall results.

Overall, GRU is better for simpler models using only historical air quality data, while LSTM is more effective when environmental factors are integrated for more comprehensive predictions.